Let’s start by reading in the packages we’ll need, setting the working directory and reading in the red wine data.
Let’s look at the first few entries to see what the data looks like, and see how many samples of the data we have.
head(reds)
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
dim(reds)
## [1] 1599 13
The first thing is to look at the distribution of wines vs. quality. We see roughly normal distribution, perhaps slightly skewed.In particular, there are very few wines in the highest quality bins. This makes sense, since high-quality wines are relatively rare and difficult to create.
Next, we’ll look at the distributions for each of the chemical properties of wine, first for all wines, and then in a boxplot sorted by quality. This should give us a good idea of how each property is distributed overall and how this distribution varies with wine quality.
The median of the fixed acidity increases with wine quality, though there are a number of outliers with large fixed acidity in the middle quality bins. Of course, there are a lot more samples in those bins.
The distribution is normal and unimodal, but with a fatter tail on the high side.
Here, the relationship is pretty clear. The low quality wines have a high median volatile acidity, whereas high-quality wines have much less.
Volatile acidity measures acetic acid (vinegar) and other impurities, so this relationship makes sense.
The distribution is not far off–could be a noisy normal or bimodal type distribution.
Median citric acid increases with wine quality, although there seem to be a bunch of outliers in quality bin 7 (all zero). Possible measurement error?
Citric acid is also the one distribution that is clearly non-normal–it’s multi-modal and skewed toward the low side.
Residual sugar seems to be roughly equivalent for the different quality bins, although far outliers in the middle bins make the scale hard to read.
The overall distribution shows that values are normal around the mean/median, but with far outliers on the high side.
There appears trend of decreasing chlorides in the higher quality bins, but this is a bit obscured by a number of far outliers that affect the scale.
As with residual sugar, the overall distribution for chlorides is normal around the mean/median, with far outliers on the high side.
Median largest in the mid-quality wines, lower at each extreme. Outliers less extreme than the previous two.
The overall distribution is unimodal, but somewhat asymmetric.
Free sulfur dioxide has a similar profile to total sulfur dioxide. Might be good to look for a correlation here.
The overall distribution is similar to free sulfur dioxide.
Median density declines with wine quality, especially in the highest bins.
The overall distribution of density is roughly normal, but with fatter tails.
Median pH also declines with wine quality, but the ranges are have a lot of overlap.
The overall distribution of pH is similar to density (normal w/fat tails) but a bit noisier.
Just when you were getting bored, here;s another clear relationship. Median sulphates increase with increasing quality.
The overall distribution is normal-ish, iwth a long tail on the high end.
Low quality wines have a relatively low alcohol content, but this goes up in bins 6 and above. A low alcohol level could be a symptom of wine going to vinegar, couldn’t it (although there could be other factors)? Good to check for correlation to volatile acidity.
Overall distribution is unimodal, but with a giant, noisy tail on the high end.
As a final step in our univariate analysis, let’s look at how strongly each property is correlated to wine quality:
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## 0.01373164 -0.12890656 -0.05065606
## total.sulfur.dioxide density pH
## -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## 0.25139708 0.47616632 1.00000000
This shows the strongest positive correlation to alcohol content, and the strongest negative correlation to volatile acidity, which makes sense.
Other factors with relatively high correlations to quality (+ or -) are sulphates and citric acid.
Now that we have looked all of the properties individually and inspected how they vary in the different quality bins, it’s time to look at how the different combinations of qualities might affect the quality of a wine.
The first thing that popped out at me is that many of these properties are related, just because of chemistry. So, I did a correlation heatplot to find variables that are not completely independent.
Here it is:
Some things to notice here:
So, the conclusion here is that many of the properties give redundant information. Maybe we can decrease the dimensionality somehow.
Note: I’m sorry the graph is so scrunched in the PDF. It looks great on a large monitor.
The next step is do look at two-dimensional scatterplots and see the relationships between some of the properties that seem most important in determining the quality of a red wind.
Let’s look at a scatterplot matrix of the some of the more significant features.
Anyhow, this doesn’t show us too much more than we saw in the heatmap. Along the diagonal you can see that the distribution of most of the properties is roughly normal (Gaussian). Citric acid is a notable exception.
This may affect the ways we choose to analyze the data.
Volatile acididty v. alcohol shows clear trends: higher quality wines have higher alcohol and lower volatile adidity.
This graph is interesting in that it shows a marked clustering of wines in quality bin 5 at low alcohol and high volatile acidity. Bin 5 is the lowest quality bin with a significant number of samples in it.
##### Sulfur dioxide v. volatile acidity
This looks like one of the better scatterplots for separating high quality from low quality wines. Again, we see a bunch of bin 5 wines clustered at low alcohol.
Since most of the properties are normally distributed, I decided to try a principal components analysis to reduce the dimensionality of the wine dataset.
I got some code from the interwebs for this one (can’t find the exact reference).
But the code and results seem reasonable, so here goes
The first step is to rescale the data (standardize range and stdev), then run the PCA.
wine <- reds
s <- as.data.frame(scale(wine[2:12]))
wine.pca <- prcomp(s)
Here is a summary of the results:
summary(wine.pca)
## Importance of components%s:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7604 1.3878 1.2452 1.1015 0.97943 0.81216 0.76406
## Proportion of Variance 0.2817 0.1751 0.1410 0.1103 0.08721 0.05996 0.05307
## Cumulative Proportion 0.2817 0.4568 0.5978 0.7081 0.79528 0.85525 0.90832
## PC8 PC9 PC10 PC11
## Standard deviation 0.65035 0.58706 0.42583 0.24405
## Proportion of Variance 0.03845 0.03133 0.01648 0.00541
## Cumulative Proportion 0.94677 0.97810 0.99459 1.00000
And a screeplot, which shows the amount how much of the variance each of the new components accounts for.
screeplot(wine.pca, type="lines")
There isn’t a real cutoff in the screeplot, but it is clear that the first 4-5 principal account for most of the variance. So let’s have a look at them.
The first PA seems to relate to general acidity. It has a weakly positive relationship to quality.
wine.pca$rotation[,1]
## fixed.acidity volatile.acidity citric.acid
## 0.48931422 -0.23858436 0.46363166
## residual.sugar chlorides free.sulfur.dioxide
## 0.14610715 0.21224658 -0.03615752
## total.sulfur.dioxide density pH
## 0.02357485 0.39535301 -0.43851962
## sulphates alcohol
## 0.24292133 -0.11323206
first_pa <- wine.pca$x[, 1]
scatterplot(first_pa ~ wine$quality,
xlab="Wine quality",
ylab="First PA",
main="Wine quality vs first PA (axis of acidity)",
labels=row.names(wine))
lm(first_pa~wine$quality)
##
## Call:
## lm(formula = first_pa ~ wine$quality)
##
## Coefficients:
## (Intercept) wine$quality
## -1.3558 0.2406
The second PA has high sulfur dioxide, high volatile acids and low alcohol (yuk). It falls off substantially in the higher-quality wines.
wine.pca$rotation[,2]
## fixed.acidity volatile.acidity citric.acid
## -0.110502738 0.274930480 -0.151791356
## residual.sugar chlorides free.sulfur.dioxide
## 0.272080238 0.148051555 0.513566812
## total.sulfur.dioxide density pH
## 0.569486959 0.233575490 0.006710793
## sulphates alcohol
## -0.037553916 -0.386180959
second_pa <- wine.pca$x[, 2]
scatterplot(second_pa ~ wine$quality,
xlab="Wine quality",
ylab="Second PA",
main="Wine quality vs second PA (axis of funk)",
labels=row.names(wine))
lm(second_pa~wine$quality)
##
## Call:
## lm(formula = second_pa ~ wine$quality)
##
## Coefficients:
## (Intercept) wine$quality
## 3.7463 -0.6647
The third PA is characterized by high volatile acidity and low alcohol. It also has low sulfur dioxide, although the meaning of this is less clear. It is inversely related to wine quality. Basically, vinegar.
third_pa <- wine.pca$x[, 3]
wine.pca$rotation[,3]
## fixed.acidity volatile.acidity citric.acid
## 0.12330157 0.44996253 -0.23824707
## residual.sugar chlorides free.sulfur.dioxide
## -0.10128338 0.09261383 -0.42879287
## total.sulfur.dioxide density pH
## -0.32241450 0.33887135 -0.05769735
## sulphates alcohol
## -0.27978615 -0.47167322
lm(third_pa~wine$quality)
##
## Call:
## lm(formula = third_pa ~ wine$quality)
##
## Coefficients:
## (Intercept) wine$quality
## 3.4698 -0.6156
scatterplot(third_pa ~ wine$quality,
xlab="Wine quality",
ylab="Third PA ",
main="Wine quality vs third PA (axis of vinegar)")
The fourth PA is very boring. Nothing strongly in the mix and no noticeble effect on quality.
fourth_pa <- wine.pca$x[, 4]
wine.pca$rotation[,4]
## fixed.acidity volatile.acidity citric.acid
## -0.229617370 0.078959783 -0.079418256
## residual.sugar chlorides free.sulfur.dioxide
## -0.372792562 0.666194756 -0.043537818
## total.sulfur.dioxide density pH
## -0.034577115 -0.174499758 -0.003787746
## sulphates alcohol
## 0.550872362 -0.122181088
scatterplot(fourth_pa ~ wine$quality,
xlab="Wine quality",
ylab="Fourth PA ",
main="Wine quality vs fourth PA (axis of nothingburger)")
lm(fourth_pa~wine$quality)
##
## Call:
## lm(formula = fourth_pa ~ wine$quality)
##
## Coefficients:
## (Intercept) wine$quality
## 0.33946 -0.06023
We’ll look at one more PA. This one seems to be characterized mostly by a lack of residual sugar, and, again, the effect on quality is minor.
fifth_pa <- wine.pca$x[, 4]
wine.pca$rotation[,5]
## fixed.acidity volatile.acidity citric.acid
## 0.08261366 -0.21873452 0.05857268
## residual.sugar chlorides free.sulfur.dioxide
## -0.73214429 -0.24650090 0.15915198
## total.sulfur.dioxide density pH
## 0.22246456 -0.15707671 -0.26752977
## sulphates alcohol
## -0.22596222 -0.35068141
scatterplot(fifth_pa ~ wine$quality,
xlab="Wine quality",
ylab="Fifth PA ",
main="Wine quality vs fifth PA (axis of dryness)")
lm(fifth_pa~wine$quality)
##
## Call:
## lm(formula = fifth_pa ~ wine$quality)
##
## Coefficients:
## (Intercept) wine$quality
## 0.33946 -0.06023
Next is an linear discriminant analysis for the wine data. It should show the main axis that determines quality as a function of all the other properties.
It looks like the LD1 accounts for most of the quality variation, but the scatterplot shows that there is too much overlap in the distributions to be able to reliably sort out any but the highest and lowest quality wines.
Although there is overlap in the distributions, this does look like a reasonable measure of quality.
One problem: there seems to be an anomaly in the density. Perhaps it scaled badly, being so close to 1?
wine <- reds
library('MASS')
wine_features <- wine
wine_features$quality <- NULL
s <- as.data.frame(scale(wine_features))
wine.lda <-
lda(wine$quality ~ wine$fixed.acidity + wine$volatile.acidity+ wine$citric.acid + wine$residual.sugar + wine$chlorides + wine$free.sulfur.dioxide + wine$total.sulfur.dioxide + wine$density + wine$pH + wine$sulphates + wine$alcohol)
wine.lda
## Call:
## lda(wine$quality ~ wine$fixed.acidity + wine$volatile.acidity +
## wine$citric.acid + wine$residual.sugar + wine$chlorides +
## wine$free.sulfur.dioxide + wine$total.sulfur.dioxide + wine$density +
## wine$pH + wine$sulphates + wine$alcohol)
##
## Prior probabilities of groups:
## 3 4 5 6 7 8
## 0.006253909 0.033145716 0.425891182 0.398999375 0.124452783 0.011257036
##
## Group means:
## wine$fixed.acidity wine$volatile.acidity wine$citric.acid
## 3 8.360000 0.8845000 0.1710000
## 4 7.779245 0.6939623 0.1741509
## 5 8.167254 0.5770411 0.2436858
## 6 8.347179 0.4974843 0.2738245
## 7 8.872362 0.4039196 0.3751759
## 8 8.566667 0.4233333 0.3911111
## wine$residual.sugar wine$chlorides wine$free.sulfur.dioxide
## 3 2.635000 0.12250000 11.00000
## 4 2.694340 0.09067925 12.26415
## 5 2.528855 0.09273568 16.98385
## 6 2.477194 0.08495611 15.71160
## 7 2.720603 0.07658794 14.04523
## 8 2.577778 0.06844444 13.27778
## wine$total.sulfur.dioxide wine$density wine$pH wine$sulphates
## 3 24.90000 0.9974640 3.398000 0.5700000
## 4 36.24528 0.9965425 3.381509 0.5964151
## 5 56.51395 0.9971036 3.304949 0.6209692
## 6 40.86991 0.9966151 3.318072 0.6753292
## 7 35.02010 0.9961043 3.290754 0.7412563
## 8 33.44444 0.9952122 3.267222 0.7677778
## wine$alcohol
## 3 9.955000
## 4 10.265094
## 5 9.899706
## 6 10.629519
## 7 11.465913
## 8 12.094444
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## wine$fixed.acidity 0.15576218 -0.510826253 -0.13230726
## wine$volatile.acidity -2.14869965 -5.169157664 -2.80464132
## wine$citric.acid -0.24353923 -1.810902037 -3.67023468
## wine$residual.sugar 0.09907188 -0.310654752 -0.27785760
## wine$chlorides -4.49075830 -3.286220068 4.88913726
## wine$free.sulfur.dioxide 0.01015280 0.002518588 0.05746815
## wine$total.sulfur.dioxide -0.01066123 0.015340541 -0.02412087
## wine$density -132.46861030 494.715527325 442.83751116
## wine$pH -0.27624041 -4.797254644 0.75289349
## wine$sulphates 2.55180806 -0.768377584 -0.57558078
## wine$alcohol 0.67697595 0.270197247 0.18108052
## LD4 LD5
## wine$fixed.acidity -1.151995674 1.826028e-01
## wine$volatile.acidity 2.625991390 -2.404376e+00
## wine$citric.acid 1.097971759 -2.639100e+00
## wine$residual.sugar -0.399628931 4.467281e-01
## wine$chlorides -8.619928322 -7.425094e+00
## wine$free.sulfur.dioxide 0.020697814 -5.792837e-02
## wine$total.sulfur.dioxide -0.009733169 7.784032e-03
## wine$density 569.905215399 -4.286164e+02
## wine$pH -8.470107031 2.487246e+00
## wine$sulphates 0.055588624 8.097894e-01
## wine$alcohol 0.800344973 -5.659672e-01
##
## Proportion of trace:
## LD1 LD2 LD3 LD4 LD5
## 0.8496 0.1028 0.0333 0.0086 0.0056
# Do a prediction
wine.lda.values <- predict(wine.lda, s$quality)
first_lda <- wine.lda.values$x[,1]
scatterplot(wine$quality, wine.lda.values$x[,1])
# ldahist(data = wine.lda.values$x[,1], g=wine$quality)
In the end, if the PCA analysis has some validity, you can see that two of the principal differences in wine were less important in determining quality: the general acidity and the residual sugar.
But two others (PA2 and PA3) seem to be signatures of problems in winemaking. PA3, with its high volatile acidity and low alcohol, is probably related to wine going to vineagar.
PA2 has some volatile acidity and low alcohol, but is mainly distinguished by a high total sulfur dioxide. This can also lend an off-taste to wine.
## fixed.acidity volatile.acidity citric.acid
## -0.110502738 0.274930480 -0.151791356
## residual.sugar chlorides free.sulfur.dioxide
## 0.272080238 0.148051555 0.513566812
## total.sulfur.dioxide density pH
## 0.569486959 0.233575490 0.006710793
## sulphates alcohol
## -0.037553916 -0.386180959
##
## Call:
## lm(formula = second_pa ~ wine$quality)
##
## Coefficients:
## (Intercept) wine$quality
## 3.7463 -0.6647
If you look at a plot of alcohol vs. total sulfur dioxide, you see an interesting cluser of wines of quality 5 at low alcohol and relatively high sulfur dioxide. I must say that I have no idea what this means.
Here is a useful summary of wine faults: https://wine.appstate.edu/sites/wine.appstate.edu/files/Chart%20Aromas%20FH_0.pdf
Tolstoy said that all happy families are alike, but each unhappy family is unhappy in its own way. Perhaps not true for families, but true enough for wine, at least for the wine in the top bins vs. the wines in the middle and lower bins.
What distinguishes the higher quality wine is the absense wine faults. It is fairly easy to distinguish the best wine from the others by the absense of these faults. PCS/LCA analysis shows some possible combinations of features that could be a signature of wine faults, but of course it’s just exploratory.
As far as the wines in the middle quality bins (that is, the overwhelming majority of the wine samples), the picture becomes more hazy because there are many types of wine fault, so wines can be less-than-perfect in many different ways, in different degrees and different combinations. I’m not sure if the mid-quality wines are blended, in which case, you’d expect wines with a complementary faults to be mixed together, which would muddy the water still more.
Anyhow, nice project, more fun than I thought.